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Abstract: Our goal is to quickly find top fc lists of nodes with the largest degrees in large 
complex networks. If the adjacency list of the network is known (not often the case in complex 
networks), a deterministic algorithm to find a node with the largest degree requires an average 
complexity of 0(n), where n is the number of nodes in the network. Even this modest complexity 
can be very high for large complex networks. We propose to use the random walk based method. 
We show theoretically and by numerical experiments that for large networks the random walk 
method finds good quality top lists of nodes with high probability and with computational savings 
of orders of magnitude. We also propose stopping criteria for the random walk method which 
requires very little knowledge about the structure of the network. 
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Detection Rapide de Noeuds a Degres Eleves 



Resume : Notre objectif est de trouver rapidement dans les grands reseaux complexes top 
k listes de noeuds avec les plus grands degres. Si la liste d'adjacence du reseau est connu (pas 
souvent le cas dans les reseaux complexes), un algorithme deterministe pour trouver un noeud 
avec le plus grand degre necessite une complexite moyenne de 0(n), ou n est le nombre de noeuds 
dans le reseau. Meme cette complexite modeste peut etre tres eleve pour les grands reseaux 
complexes. Nous proposons d'utiliser une methode base sur le marche aleatoire. Nous montrons 
theoriquement et par experimentations numeriques que pour les grands reseaux la methode de 
marche aleatoire trouve top k listes de bonne qualite avec une forte probabilite de reussite et 
avec des economies de calcul de plusieurs ordres de grandeur. Nous proposons egalenient des 
criteres d'arret pour la methode de marche aleatoire qui ne necessite pas de connaissance de la 
structure du reseau. 

Mots-cles : reseaux complexes, detection de noeuds avec les plus grands degres, top k liste, 
marche aleatoire, criteres d'arret 
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1 Introduction 

We are interested in quickly detecting nodes with large degrees in very large networks. Firstly, 
node degree is one of centrality measures used for the analysis of complex networks. Secondly, 
large degree nodes can serve as proxies for central nodes corresponding to the other centrality 
measures as betweenness centrality or closeness centrality [HI [5] • In the present work we restrict 
ourself to undirected networks or symmetrized versions of directed networks. In particular, 
this assumption is well justified in social networks. Typically, friendship or acquaintance is 
a symmetric relation. If the adjacency list of the network is known (not often the case in 
complex networks) , the straightforward method that comes to mind is to use one of the standard 
sorting algorithms like Quicksort or Heapsort. However, even their modest average complexity, 
0(ri, log(n)), can be very high for very large complex networks. In the present work we suggest 
using random walk based methods for detecting a small number of nodes with the largest degree. 
The main idea is that the random walk very quickly comes across large degree nodes. In our 
numerical experiments random walks outperform the standard sorting procedures by orders of 
magnitude in terms of computational complexity. For instance, in our experiments with the web 
graph of the UK domain (about 18 500 000 nodes) the random walk method spends on average 
only about 5 400 steps to detect the largest degree node. Potential memory savings are also 
significant since the method does not require knowledge of the entire network. In many practical 
applications we do not need a complete ordering of the nodes and even can tolerate some errors 
in the top list of nodes. We observe that the random walk method obtains many nodes in the top 
list correctly and even those nodes that are erroneously placed in the top list have large degrees. 
Therefore, as typically happens in randomized algorithms [T^llISI, we trade off exact results for 
very good approximate results or for exact results with high probability and gain significantly in 
computational efficiency. 

The paper is organized as follows: in the next section we introduce our basic random walk 
with uniform jumps and demonstrate that it is able to quickly find large degree nodes. Then, 
in Section 3 using configuration model we provide an estimate for the necessary number of steps 
for the random walk. In Section 4 we propose stopping criteria that use very little information 
about the network. In Section 5 we show the benefits of allowing few erroneous elements in the 
top k list. Finally, we conclude the paper in Section 6. 

2 Random walk with uniform jumps 

Let us consider a random walk with uniform jumps which serves as a basic algorithm for quick 
detection of large degree nodes. The random walk with uniform jumps is described by the 
following transition probabilities [T] 



where di is the degree of node i. The random walk with uniform jumps can be regarded as a 
random walk on a modified graph where all the nodes in the graph are connected by artificial 
edges with a weight a/n. The parameter a controls the rate of jumps. Introduction of jumps 
helps in a number of ways. As was shown in [1], it reduces the mixing time to stationarity. It also 
solves a problem encountered by a random walk on a graph consisting of two or more components, 
namely the inability to visit all nodes. The random walk with jumps also reduces the variance 
of the network function estimator [T] . This random walk resembles the PageRank random walk. 
However, unlike the PageRank random walk, the introduced random walk is reversible. One 




if i has a link to j, 

if i does not have a link to j. 



(1) 
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important consequence of the reversibility of the random walk is that its stationary distribution 
is given by a simple formula 

from which the stationary distribution of the original random walk can easily be retrieved. We 
observe that the modification preserves the order of the nodes' degrees, which is particularly 
important for our application. 

We illustrate on several network examples how the random walk helps us quickly detect large 
degree nodes. We consider as examples one synthetic network generated by the preferential 
attachment rule and two natural large networks. The Preferential Attachment (PA) network 
combines 100 000 nodes. It has been generated according to the generalized preferential attach- 
ment mechanism [B] . The average degree of the PA network is two and the power law exponent 
is 2.5. The first natural example is the symmetrized web graph of the whole UK domain crawled 
in 2002 [3]. The UK network has 18 520 486 nodes and its average degree is 28.6. The second 
natural example is the network of co-authorships of DBLP [S] . Each node represents an author 
and each link represents a co-authorship of at least one article. The DBLP network has 986 324 
nodes and its average degree is 6.8. 

We carry out the following experiment: we initialize the random walk (HI) at a node chosen 
according to the uniform distribution and continue the random walk until we hit the largest 
degree node. The largest degrees for the PA, UK and DBLP networks are 138, 194 955, and 
979, respectively. For the PA network we have made 10 000 experiments and for the UK and 
DBLP networks we performed 1 000 experiments (these networks were too large to perform more 
experiments). 

In Figue [T] we plot the histograms of hitting times for the PA network. The first remarkable 
observation is that when a = (no restart) the average hitting time, which is equal to 123 000, 
is nearly three orders of magnitude larger than 3 720, the hitting time when a = 2. The second 
remarkable observation is that 3 720 is not too far from the value 

l/TTm.ax{a) = {2\E\ + na) / {d„^ax + Ci) = 2 857, 

which corresponds to the average return time to the largest degree node in the random walk with 
jumps. 





(a) a = (b) a = 2 

Figure 1: Histograms of hitting times in the PA network. 



We were not able to collect a representative number of experiments for the UK and DBLP 
networks when a = 0. The reason for this is that the random walk gets stuck either in disconected 
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or weakly connected components of the networks. For the UK network we were able to make 
1 000 experiments with a = 0.001 and obtain the average hitting time 30 750. Whereas if we take 
a = 28.6 for the UK network, we obtain the average hitting time 5 800. Note that the expected 
return time to the largest degree node in the UK network is given by 

max 

+ a) = 5 432. 

For the DBLP graph we conducted 1 000 experiments with a ~ 0.00001 and obtained an average 
hitting time of 41 131. Whereas if we take a = 6.8, we obtain an average hitting time of 14 200. 
The expected return time to the largest degree node in the DBLP network is given by 



l/TTmax{a) = {2\E\+na)/{d 



a) = 13 607. 



The two natural network examples confirm our guess that the average hitting time for the largest 
degree node is fairly close to the average return time to the largest degree node. Let us also 
confirm our guess with asymptotic analysis. 

Theorem 1 Without loss of generality, index the nodes such that node 1 has the largest degree, 
(1, i) G E,i = 2, s, s = di + 1, and let v denote the initial distribution of the random walk with 
jumps. Then, the expected hitting time to node 1 starting from any initial distribution v is given 
by 

YJi=2 di + (n- l)a 



+ min {{di + a),n} 

\ i—2 s 



di + 2a(l - 1/n) 

Proof: The expected hitting time from distribution v to node 1 is given by the formula 



(3) 



(4) 



where P_i is a taboo probability matrix (i.e., matrix P with the 1-st row and 1-st column 
removed). The matrix P_i is substochastic but is very close to stochastic. Let us represent it as 
a stochastic matrix minus some perturbation term: 



P-i = P - eQ = P 



l+2a/n 
d2+a 











l+2a/n 



2a/ n 













2a/n 
d„+a J 



We add missing probability mass to the diagonal of P, which corresponds to an increase in the 
weights for self-loops. The matrix P represents a reversible Markov chain with the stationary 
distribution 

dj + a 

Now we can use the following result from the perturbation theory (see Lemma 1 in [2]): 



[I-P + eQ]-^ = - 



Xq + eXi + 



(5) 
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where tt is the stationary distribution of the stochastic matrix P. In our case, the quantity 
max,-2,...,i5{l/(rfi + a), 1/?^} wiU play the role of e. We apply the series ([5]) to approximate the 
expected hitting time. Towards this goal, we calculate 

n 

dj + a l + 2a/n dj + a 2a /n 



E 



Er=2 d^ + in- l)a dj + a J27=2 + (n - l)a dj + a 

_ di{\ + 2a/n) + (n - rfi - l){2a/n) _ di + 2a(l - l/n) 

Er=2 d^ + {n- l)a YTi=2 d^ + in- l)a ' 

Observing that lyVrtl = 1 , we obtain ^ . 

□ 

Indeed, the asymptotic expression ^ is very close to (2|i?| + na)/{di + a), which is the 
expected return time to node 1. 

Based on the notion of the hitting time we propose an efficient method for quick detection of 
the top k list of largest degree nodes. The algorithm maintains a top k candidate list. Note that 
once one of the k nodes with the largest degrees appears in this candidate list, it remains there 
subsequently. Thus, we are interested in hitting events. We propose the following algorithm for 
detecting the top k list of largest degree nodes. 

Algorithm 1 Random walk with jumps and candidate list 

1. Set k, a and m. 

2. Execute a random walk step according to ([T]). 

3. Check if the current node has a larger degree than one of the nodes in the current top k 
candidate list. If it is the case, insert the new node in the top-k candidate list and remove 
the worst node out of the list. 

4. If the number of random walk steps is less than m, return to Step 2 of the algorithm. Stop, 
otherwise. 

The value of parameter a is not crucial. In our experiments, we have observed that as long as 
the value of a is neither too small nor not too big, the algorithm performs well. A good option 
for the choice of a is a value slightly smaller than the average node degree. Let us explain this 
choice by calculating a probability of jump in the steady state 

EOL -^—^ dj + a a na a 

^^^"^dT+a ~ ^ 2\E\ +na d, +a ^ 2\E\ + na ^ 2\E\/n + a' 
j=i J j=i J 

If a is equal to 2|i?|/n, the average degree, the random walk will jump in the steady state on 
average every two steps. Thus, if we set a to the average degree or to a slightly smaller value, 
on one hand the random walk will quickly converge to the steady state and on the other hand 
we will not sample too much from the uniform distribution. 

The number of random walk steps, to, is a crucial parameter. Our experiments indicate that 
we obtain a top k list with many correct elements with high probability if we take the number 
of random walk steps to be twice or thrice as large as the expected hitting time of the nodes in 
the top k list. From Theorem 1 we know that the hitting time of the large degree node is related 
to the value of the node's degree. Thus, the problem of choosing m reduces to the problem of 
estimating the values of the largest degrees. We address this problem in the following section. 
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3 Estimating the largest degrees in the configuration net- 
work model 

The estimations for the values of the largest degrees can be derived in the configuration network 
model [7] with a power law degree distribution. In some applications the knowledge of the power 
law parameters might be available to us. For instance, it is known that web graphs have power 
law degree distribution and we know typical ranges for the power law parameters. 

We assume that the node degrees Z^i, . . . are i.i.d. random variables with a power law 
distribution F and finite expectation E[D]. Let us determine the number of links contained in 
the top k nodes. Denote 

F{x) = P[D < x], F{x) = 1 - F{x), X > 0. 

Further let D(^i) > ■ ■ ■ > D(^n) be the order statistics of Di, . . . , Under the assumption that 
Dj's obey a power law, we use the results from the extreme value theory as presented in to 
state that there exist sequences of constants (a„) and and a constant 8 such that 

lim nF{anX + 6„) = (1 + 5x)'^l^ . (6) 

n— ^oo 

This implies the following approximation for high quantiles of F , with exceedance probability 
close to zero 

For the jth largest degree, where j = 2, . . . , fc, the estimated exceedance probability equals 
(j — and thus we can use the quantile a;(j_i)/„ to approximate the degree of this node: 



D(j) w a„ h o„. (7) 



The sequences (a„) and (&„) are easy to find for a given shape of the tail of F. Below we 
derive the corresponding results for the commonly accepted Pareto tail distribution of D, that 
is, 

F{t) = Cx-^ for x>x', (8) 

where 7 > 1 and x' is a fixed sufficiently large number so that the power law degree distribution 
is observed for nodes with degree larger than x' . In that case we have 

lim nF{anX + bn) ^ lim nC {a^x + = Wm {C''^/'' n''^''' anX -^^ C-^''^ n-'^/''bn)~'^ , 

n— ^00 n— >oo n— >c« 

which directly gives © with 

5=1/7, fln = 5C^n\ bn = C^n^. (9) 

Substituting @ into ([7]) we obtain the following prediction for j = 2, . . . , fc, in the case of 
the Pareto tail of the degree distribution: 

« n^'''[C^'-'{j - l)-^'-' - C^'-' + 1]. (10) 

It remains to find an approximation for the maximal degree in the graph. From the 

extreme value theory it is well known that if Di , . . . , Dn obey a power law then 

lim P f <x]= Hs{x) = exp(-(l + 6x)-^/^), 
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where, for Pareto tail, a„, 6„ and S are defined in Thus, as an approximation for the maximal 
node degree we can choose a„a; + 6„ where x can be chosen as either an expectation, a median 
or a mode of Hs{x). If we choose the mode, ((1 + 6)^^ — 1)/S, then we obtain an approximation, 
which is smaller than the one for the 2nd largest degree. Further, the expectation (r(l — 5) — 1)/S 
is very sensitive to the value of (5 = I/7, especially when 7 is close to one, which is often the case 
in complex networks. Besides, the parameter 7 is hard to estimate with high precision. Thus, 
we choose the median (log(2))~'^ — 1)/S, which yields 

^(1) « a„ ^^""^^^Y ^ +bn = ni/'^[Ci/7(log(2))-V7 _ c'h + 1]. (u) 

For instance, in the PA network 7 = 2.5 and C — 3.7, which gives according to PT|) « 
127. (This is a good prediction even though the PA network is not generated according to the 
configuration model. We also note that even though the extremum distribution in the preferential 
attachment model is different from that of the configuration model their ranges seem to be very 
close HO].) This in turn suggests that for the PA network m should be chosen in the range 
6 000-18 000 if a = 2. As we can see from Figure [5] this is indeed a good range for the number 
of random walk steps. In the UK network 7 = 1.7 and C = 90, which gives Dj^ij « 82 805 and 
suggests a range of 20 000-30 000 for m if a = 28.6. Figure [3] confirms that this is a good choice. 
The degree distribution of the DBLP network does not follow a power law so we cannot apply 
the above reasoning to it. 



4 Stopping criteria 

Suppose now that we do not have any information about the range for the largest k degrees. In 
this section we design stopping criteria that do not require knowledge about the structure of the 
network. As we shall see, knowledge of the order of magnitude of the average degree might help, 
but this knowledge is not imperative for a practical implementation of the algorithm. 

Let us now assume that node j can be sampled independently with probability 7rj{a) as in 
([2]). There are at least two ways to achieve this practically. The first approach is to run the 
random walk for a significant number of steps until it reaches the stationary distribution. If 
one chooses a reasonably large, say the same order of magnitude as the average degree, then 
the mixing time becomes quite small [T] and we can be sure to reach the stationary distribution 
in a small number of steps. Then, the last step of a run of the random walk will produce an 
i.i.d. sample from a distribution very close to The second approach is to run the random 
walk uninterruptedly, also with a significant value of a, and then perform Bernoulli sampling 
with probability q after a small initial transient phase. If q is not too large, we shall have nearly 
independent samples following the stationary distribution ([2]). In our experiment, q G [0.2,0.5] 
gives good results when a has the same order of magnitude as the average degree. 

We now estimate the probability of detecting correctly the top k list of nodes after m i.i.d. 
samples from Denote by Xi the number of hits at node i after m i.i.d. samples. We note 
that if we use the second approach to generate i.i.d. samples, we spend approximately m/q 
steps of the random walk. We correctly detect the top k list with the probability given by the 
multinomial distribution 

P[Xi > l,...,Xfe > 1] = 



E 



ii>l,...,ii>l 



zi! • • •u-!(m - ii 
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but it is not feasible for any realistic computations. Therefore, we propose to use the Pois- 
son approximation. Let Yj, j = be independent Poisson random variables with means 

TTjTO. That is, the random variable Yj has the following probability mass function P[Yj = r] = 
Q-^ri^i (^TiYL'KjY lr\. It is Convenient to work with the complementary event of not detecting cor- 
rectly the top k list. Then, we have 

P[{Xi = 0} U ... U {Xk = 0}] < 2P[{Yi = 0} U ... U {Yk - 0}] 

A; 

= 2(1 - P[{yi > 1} n ... n {y^ > i}]) = 2(1 - J] P[{y, > m 

3 = 1 

k k 

= 2(1 - - Pm = 0}])) = 2(1 - - e-'-'^O) =: «, (12) 

where the first inequality follows from |121 Thm 5.10]. In fact, in our numerical experiments we 
observed that the factor 2 in the first inequality is very conservative. For large values of to, the 
Poisson bound works very well as proper approximation. 

For example, if we would like to obtain the top 10 list with at most 10% probability of error, 
we need to have on average 4.5 hits per each top element. This can be used to design the stopping 
criteria for our random walk algorithm. Let a G (0, 1) be the admissible probability of an error 
in the top fc list. Now the idea is to stop the algorithm after to steps when the estimated value 
of a for the first time is lower than the critical number a. Clearly, 

fe 

a™ = 2(l-[|(l-e-^0) 

is the maximum likelihood estimator for a, so we would like to choose to such that < a. The 
problem, however, is that we do not know which Xj^s are the realisations of the number of visits 
to the top fc nodes. Then let Xj^, ...,Xj^ be the number of hits to the current elements in the 
top k candidate list and consider the estimator 

fe 

a™,o = 2(l-[](l-e-^^-.)), 

i=l 

which is the maximum likelihood estimator of the quantity 

fc 

2(l-[|(l-e-"'^^.)) >a- 

1=1 

(Here iTj. is a stationary probability of the node with the score Xj. , i = 1, . . . ,k). The estimator 
am,o is computed without knowledge of the top fc nodes or their degrees, and it is an estimator 
of an upper bound of the estimated probability that there are errors in the top fc list. This leads 
to the following stopping rule. 
Stopping rule 0. Stop at m = toq, where 

Too = argmin{TO : a,„,o < a}. 
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The above stopping criterion can be simphfied even further to avoid computation of dm.o- 
Since 

a™,i := 2(1 - (1 - e-^^" )'') > a,n,o > a, 

where Xj^ is the number of hits of the worst element in the candidate list. The inequality 
o-m < a is guaranteed if a,„,i < a. This leads to the following stopping rule for the random walk 
algorithm. 

Stopping rule 1. Compute xq ~ argmin{a; e N : (1 — e^^)''' > 1 — a/2.} Stop at 

mi = argminjm : Xj,, = xq}- 

We have observed in our numerical experiments that we obtain the best trade off between the 
number of steps of the random walk and the accuracy if we take a around the average degree and 
the sampling probability q around 0.5. Specifically, if we take a/2 = 0.15 {xq = 4) in Stopping 
rule 1 for top 10 list, we obtain 87% accuracy for an average of 47 000 random walk steps for the 
PA network; 92% accuracy for an average of 174 468 random walk steps for the DBLP network; 
and 94% accuracy for an average of 247 166 random walk steps for the UK network. We have 
averaged over 1000 experiments to obtain tight confidence intervals. 

5 Relaxation of top k lists 

In the stopping criteria of the previous section we have strived to detect all nodes in the top k 
list. This costs us a lot of steps of the random walk. We can significantly gain in performance by 
relaxing this strict requirement. For instance, we could just ask for list of k nodes that contains 
80% of top k nodes |2- This way we can take an advantage of a generic 80/20 rule that 80% of 
result can be achieved with 20% of effort. 

Let us calculate the expected number of top k elements observed in the candidate list up to 
trial m. Define by Xj the number of times we have observed node j after m trials and 

^ _ r 1, node j has been observed at least once, 
\ 0, node j has not been observed. 

Assuming we sample in i.i.d. fashion from the distribution ([5]), we can write 

£^[E^^-] = E^[^^-] - E^[^^- ^ 1] = E(i - ^[^^- = 0]) = E(i - (1 - '^^■)™)- (13) 
j=i j=i j=i j=i j=i 

In Figure [2] we plot £'Ej=i ^j] (the curve "I.I.D. sample") as a function of m for fc = 10 for 
the PA network with a = and a = 2. In Figure El we plot i^Ej^i^j] 

as a function of m for 

fc = 10 for the UK network with a = 0.001 and a = 28.6. The resuhs for the UK and DBLP 
networks are similar in spirit. 

Here again we can use the Poisson approximation 

k k 

^E^^]«E(i-^"™'^o- 

In fact, the Poisson approximation is so good that if we plot it on Figures [5] and [31 it nearly covers 
exactly the curves labeled "I.I.D. sample", which correspond to the exact formula (jl3p . Similarly 
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"' X 10^ " X 11)' 

(a) o = (b) o = 2 

Figure 2: Average number of correctly detected elements in top-10 for PA. 




0.5 1 1.5 2 0.5 1 1.5 2 

X lo' s lo' 



(a) a = 0.001 (b) a = 28.6 

Figure 3: Average number of correctly detected elements in top-10 for UK. 

to the previous section, we can propose stopping criteria based on the Poisson approximation. 
Denote 

k 

&„ = ^(l-e-^-). 

Stopping rule 2. Stop at m = m2, where 

m2 = argmin{TO : bm > 

Now if we take & = 7 in Stopping rule 3 for top-10 list, we obtain on average 8.89 correct 
elements for an average of 16 725 random walk steps for the PA network; we obtain on average 
9.28 correct elements for an average of 66 860 random walk steps for the DBLP network; and 
we obtain on average 9.22 correct elements for an average of 65 802 random walk steps for the 
UK network. (We have averaged over 1000 experiments for each network.) This makes for the 
UK network the gain of more than two orders of magnitude in computational complexity with 
respect to the deterministic algorithm. 
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6 Conclusions and future research 

We have proposed the random walk method with the candidate hst for quick detection of largest 
degree nodes. We have also supplied stopping criteria which do not require knowledge of the 
graph structure. In the case of large networks, our algorithm finds top k list of largest degree 
nodes with few mistakes with the running time orders of magnitude faster than the deterministic 
sorting algorithm. In future research we plan to obtain estimates for the required number of 
steps for various types of complex networks. 
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